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Abstract Gaussian Process (GP) regression models typically assume that residuals are Gaussian 
and have the same variance for all observations. However, applications with input-dependent noise 
(heteroscedastic residuals) frequently arise in practice, as do applications in which the residuals 
do not have a Gaussian distribution. In this paper, we propose a GP Regression model with a 
latent variable that serves as an additional unobserved covariate for the regression. This model 
(which we call GPLC) allows for heteroscedasticity since it allows the function to have a changing 
partial derivative with respect to this unobserved covariate. With a suitable covariance function, 
our GPLC model can handle (a) Gaussian residuals with input-dependent variance, or (b) non- 
Gaussian residuals with input-dependent variance, or (c) Gaussian residuals with constant variance. 



We compare our model, using synthetic datasets, with a model proposed by Goldberg, Williams 



and Bishop ( 1998), which we refer to as GPLV, which only deals with case (a), as well as a standard 
GP model which can handle only case (c). Markov Chain Monte Carlo methods are developed for 
both modelsl. Experiments show that when the data is heteroscedastic, both GPLC and GPLV 
give better results (smaller mean squared error and negative log-probability density) than standard 
GP regression. In addition, when the residual are Gaussian, our GPLC model is generally nearly as 
good as GPLV, while when the residuals are non-Gaussian, our GPLC model is better than GPLV. 



1 Introduction 

Gaussian Process (GP) regression models are popular in the machine learning community (see. 



for example, the text by|Rasmussen and Williams (2006)), perhaps mainly because these models 



are very flexible — one can choose from many covariance functions to achieve different degrees of 
smoothness or different degrees of additive structure, the parameters of such a covariance function 
can be automatically determined from the data. However, standard GP regression models typi- 
cally assume that the residuals have i.i.d. Gaussian distributions that do not depend on the input 
covariates, though in many applications, the variances of the residuals do depend on the inputs, 
and the distributions of the residuals are not necessarily Gaussian. In this paper, we present a 
GP regression model which can deal with input-dependent residuals. This model includes a latent 
variable with a fixed distribution as an unobserved input covariate. When the partial derivative of 
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the response with respect to this unobserved covariate changes across observations, the variance of 
the residuals wih change. When the latent variable is transformed non-linearly, the residuals will 
be non-Gaussian. We call this the "Gaussian Process with a Latent Covariate" (GPLC) regression 
model. 

In Section [2] below, we give the details of this model as well as the standard GP model ("STD") 
and a model due to Goldberg, Williams and Bishop (1998), which we call the "Gaussian Process 



regression with a Latent Variance" (GPLV) model, and we discuss the relationships/equivalencies 
between these models. We describe computational methods in Section [3j and present the results of 
these models on various synthetic datasets in Section |4| 



2 The models 

We look at non-linear regression problems, where the aim is to find the association between a 
vector of covariates x and a response y using n observed pairs (xi,yi), {xn,yn), and then make 
predictions for yn+i,yn+2i ■■■ corresponding to Xn+i, Xn+2---- 

Vi = f{xi) + ei (1) 

The covariate Xi is a vector of length p, and the correspoding response yi is a scalar. 



2.1 The standard GP regression model 

In the standard Gaussian process regression model The random residuals, e^'s, are assumed to have 
i.i.d. Gaussian distributions with mean and constant variance cr^. 

Bayesian GP models assume that the noise-free function / comes from a Gaussian Process which 
has prior mean function zero and some specified covariance function. Note that a zero mean prior 
is not a requirement — we could specify a non-zero prior mean function m{x) if we have a priori 
knowledge of the mean structure. Using a zero mean prior just reflects our prior knowledge that the 
function is equally likely to be positive or negative. It doesn't mean we believe the actual function 
will have an average over its domain of zero. 

The covariance functions can be fully-specified functions, but common practice is to specify a 
covariance function with unknown hyperparameters, and then estimate the hyperparameters from 
the data. Given the values of the hyperparameters, the responses, y, in a set of cases have a 
multivariate Gaussian distribution with zero mean and a covariance matrix given by 

Cov(yi, yj) = K{xi, Xj) + Sija'^ (2) 

where 5ij = when i ^ j and da = 1, and K{xi,Xj) is the covariance function of /. Any function 
that always leads to a positive semi-definite covariance matrix can be used as a covariance function. 
One example is the squared exponential covariance function with isotropic length-scale: 

K{xi,Xj) = + 7?^exp ^- 11^' ^^^^^ ^ (3) 

Where c is a suitably chosen constant, and r],p and a are hyperparameters. ry^ is sometimes 
referred to as the "signal variance", which controls the magnitude of variation of /; o"^ is the 



2 



residual variance; p is the length scale parameter for the input covariates. We can also assign a 
different length scale parameter to each covariate, which leads to the squared exponential covariance 
function with automatic relevance determination (ARD): 

rr,) = c2 + exp 1^- g (4) 

We will use squared exponential forms of covariance function from ([s]) or Q in most of this paper. 

If the values of the hyperparameters are known, then the predictive distribution of y* for a test 
case X* based on observed values x = (xi, x„) and y = {yi, ■■■,yn) is Gaussian with the following 
mean and variance: 

E{y^\x,y,x^) = k^C~^y (5) 

Yai{y^\x,y,x^:) = v — k''C~'^k (6) 

In the equations above, k is the vector of covariances between y^, and each of yi, ■ ■ ■ ,yn, C is the 
covariance matrix of the observed y, and v is the prior variance of y*, which is Cov(y=K, y,,,) from ([2]). 

When the values of the hyperparameters (denoted as 9) are unknown and therefore have to be 
estimated from the data, we put priors on them (typically independent Gaussian priors on the loga- 
rithm of each hyperparameter) , and obtain the posterior distribution p{0\x, y) oc Af{y\0, C{6))p{9). 
The predictive mean of y can then be computed by integrating over the posterior distribution of 
the hyperparameters: 

E{y,\x,y,x,) = [ k{efC{9)-^yp{e\x,y)de (7) 
Je 

Letting £ = E'(y*|x, y, x*), the predictive variance is given by 
Var(y*|x,y,x*) = E0[Vai{y^\x,y,x^:,9)] +Yare[E{y^\x,y,x^,9)] (8) 
[v{9)-ki9fC{9)-^k{9)]p{9\x,y)d9+ [ [k{9)'^C{9)-^y - p{9\x,y)d9 

2.2 A GP regression model with a latent covariate 

In this paper, we consider adding a latent variable w into the model as an unobserved input. The 
regression equation then becomes 

yi = g{xi,Wi) + Ci- (9) 

In this setting, the latent value Wi has some known random distribution, the same for all i. If we 
view g{xi,Wi) as a function of Xi only, its value is random, due to the randomness of Wi. So g is 
not the regression function giving the expected value of y for a given value of x — that is given by 
the averge value of g over all w, which we write as f{x) = E{y\x): 

f{x)= / g{x,w)p{w)dw (10) 



where p{w) is the probability density of w. Note that ( 10 ) implies that the term (^j, which we assume 
has i.i.d. Gaussian distribution with constant variance, is not the real residual of the regression, 
since 



Ci = yi- 9{xi,Wi) ^ yi- f{xi) = e. 
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where is the true residual. We put (^j in the regression for two reasons. First, the covariance 
function for g can sometimes produce nearly singular covariance matrices, that are computationally 
non-invertible because of round-off error. Adding a small diagonal term can avoid the computational 
issue without significantly changing the properties of the covariance matrix. Secondly, the function g 
will produce a probability density function for e that has singularities at points where the derivative 
of g with respect to w is zero, which is probably not desired in most applications. Adding a jitter 
term Q smooths away such singularities. 

We will model g{x, w) using a GP with a squared exponential covariance function with ARD, 
for which the covariance between training cases i and j, with latent values Wi and Wj, is 

Cov(yi, yj) = K{{xi,Wi), {xj,Wj)) + a'^dij (11) 



We choose independent standard normals as the distributions for wi, ...,Wn- The mean for the 
Wi is chosen to be zero because the squared exponential function is stationary, and hence only the 
difference between Wi and wj matters. The variance of the Wi is fixed at 1 because the effect of a 
change of scale of Wi can be achieved instead by a change in the length scale parameter ^p+i. 

We write p{w) for the density for the vector of latent variables w, and p{9) for the prior density of 
all the hyperparameters (denoted as a vector 6). The posterior joint density for the latent variables 
and the hyperparameters is 

p{w, e\x, y) = AA(y|0, C(0, w))p{w)p{e) (12) 

where AA(o|/i, S) denotes the probability density of a multivariate Gaussian distribution with mean 
^ and covariance matrix E, evaluated at a. C{6,w) is the covariance matrix of y, which depends 
on 9 and w. 

The prediction formulas for GPLC models are similar to ([T]) and ([s]). In addition to averaging 
over the hyperparameters, we also have to average over the posterior distribution of the latent 
variables w = {wi, ...,Wn)- 

E(y.l..y...) = f I H8,.,..fC(0..)-^yp(eM^.y)i0a^. (13) 



Var(y*|x,7/,x^,) = £'6i,^[Var(?/*|x, y, x*, 6*, w)] +Yare,w[E{y^:\x,y,x^:,6,w)] (14) 
Je 

\_k{9, w, w^,)'^C{9, w)~^y — S]'^ p{w, 9\x, y)d9dwdw^ 



w Je 



where £ = i?(y*|x, y, x*) 

Note that the vector of covariances of the response in a test case with the responses in training 
cases, written as k{9,w) in (13) and (14), depends on, w*, the latent value for the test case. Since 
we do not observe w^,, we randomly draw values from the prior distribution of u)*, compute the 
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Figure 1: How GPLC can produce a non-Gaussian distribution of residuals. 

corresponding expectation or variance and take the average. Similarly, the prior variance for the 
response in a test case, written v{9, w^:) above, depends in general on w^: (though not for the squared 
exponential covariance function that we use in this paper). 

To see that this model allows residual variances to depend on x, and that the residuals can have 
non-Gaussian distributions, we compute the Taylor-expansion of g{x, w) at w = 0: 



w 

g{x, w) = g{x, 0) + g'^ix, 0)u; + —g'^x, 0) + ... 



(15) 



where ^2 and denotes the first and second order partial derivatives of g with respect to its second 
argument {w). If we can ignore the second and higher order terms, i.e. the linear approximation is 
good enough, then the response given x is Gaussian, and 



\ai[g{x,w)] « + [<72(x,0)]2Var(i/;) = [52(^,0)]^ 



(16) 



which depends on x when g'2{x,Q) depends on x (which usually is the case when g is drawn from 
a GP prior). Thus in this case, the model produces Gaussian residuals with input-dependent 



variances. 



If the high-order terms in (15) cannot be ignored, then the model will have non-Gaussian, input- 
dependent residuals. For example, consider g{x^ w) = {x + w)"^, where the second order term in w 
clearly cannot be ignored. Conditional on x, g{x, w) follows a non-central Chi-Squared distribution. 
Figure [l] illustrates that at x = 2, an unobserved normally distributed input w translates into a 
non-Gaussian output y. Note that for demonstration purposes the density curves of w and y are 
not to scale (since the scales on the x-axis and y-axis are different). 

Figure [2] illustrates how an unobserved covariate can produce heteroscedasticity. The data in the 
left plot are generated from a GP, with Xj drawn uniformly from [0,5] and Wi drawn from A^(0, 1). 
The hyperparameters of the squared exponential covariance function were set to ?? = 3, px = 0.8, 
and pu] = 3. Supposing we only observe (x,y), the data is clearly heteroscedastistic, since the 
spread of y against x changes when x changes. For instance, the spread of y looks much bigger 
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Figure 2: Heteroscedasticity produced by an unobserved covariate. The left plot shows a sample of 
X and y from the GP prior, with w not shown. The right plot shows 19 dashed curves of g{xi,Wi) 
(for the same g as on the left) where the Wi are fixed to the same value, equal to the 5?th percentile 
of the standard normal for the ith curve (i.e. 1 to 19). 



when X is around 1.8 than it is when x is near 3.5. We also notice that the distribution of the 
residuals can't be Gaussian, as, for instance, we see strong skewness near x = 5. These plots show 
that if an important input quantity is not observed, the function values based only on the observed 
inputs will in general be heteroscedastic, and non-Gaussian (even if the noise term Cj is Gaussian). 

Note that although an unobserved input quantity will create heteroscedasticity, our model can 
work well even if no such quantity really exists. The model can be seen as just using the latent 
variable as a mathematical trick, to produce changing residual variances. Whether or not there 
really exists an unobserved input quantity doesn't matter (though in practice, unobserved quantities 
often do exist). 



2.3 A GP regression model with a latent variance 



Goldberg, Williams and Bishop ( 1998 ) proposed a GP treatment of regression with input-dependent 
residuals. In their scheme, a "main" GP models the mean of the response just like a standard GP 
regression model, except the residuals are not assumed to have constant variance — a secondary 
GP is used to model the logarithm of the standard deviation of the noise, which depends on the 
input. The regression equation looks the same as in ([T]): 

Vi = f{xi) + ei (17) 

but the residuals ei, ...e„ are do not have the same variance — instead, the logarithm of the standard 
deviation Zi = \o%SD\e(xi)\ depends on Xi through: 

Zi = r{xi) + Ji (18) 
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/(x) and r(x) are both given (independent) GP priors, with zero mean and covariance functions 
Cy and Cz, which have different hyperparameters (e.g. {rjyiPy) and {r]z,Pz))- Ji is a Gaussian 



"jitter" term (see Neal, 1997) which has i.i.d. Gaussian distribution with zero mean and standard 
deviation aj (a preset constant, usuahy a very smaU number, e.g. 10 Writing x — (^i, •••^^n)? 
y = (yi, ...,2/„), 6y = {'ny,Py), 6z = {rjzjPz), and z = {zi,...,Zn), the posterior density function of 
the latent values and the hyperparameters is 

Pi9y,ez,z\x,y) oc p{y\x,z,ey)piz\x,ez)p{ey,ez) « Miy\o,Cyiey,z))M{z\o,Cz{ez))piey,ez) (i9) 

where Cy is the covariance matrix for y (for the "main" GP), Cz is the covariance matrix for 
z (for the "secondary" GP) and p{9y,0z) represents the prior density for the hyperparameters 
(typically independent Gaussian priors for their logarithms). The predictive mean can be com- 
puted in a similar fashion as the prediction of GPLC, but instead of averaging over w, we average 
over z. To compute the covariance vector k, we need the value of Zn+i, which we can sample from 

p(z„+l|zi, ...,Zn). 

Alternatively, instead of using a GP to model the logarithm of the residuals standard deviations, 
we can set the standard deviations to the absolute values of a function modeled by a GP. That is, 
we let SD{ei) = \zi\, with Zi = r{xi). So the regression model can be written as 

Ui = f{xi) + r{xi)ui (20) 

where Ui ~ A'"(0, 1). 

This is similar to modeling the log of the standard deviation with a GP, but it does allow the 
standard devaition, to be zero, whereas exp(2:j) is always positive, and it is less likely to produce 
extremely large values for the standard deviation of a residual. A more general approach is taken 



by Wilson and Ghahramani (2010), who use a parameterized function to map values modeled by a 



GP to residual variances, estimating the parameters from the data. 

In the original paper by [Goldberg et. al.] a toy example was given where the hyperparameters 
are all fixed, with only the latent values sampled using MCMC. In this paper, we will take a full 



Bayesian approach, where both the hyperparameters and the z values are sampled from (19). In 
addition, we will discuss fast computation methods for this model in Section [3j 



2.4 Relationships between GPLC and other models 

We will show in this section that GPLC can be equivalent to the a standard GP regression model 
or a GPLV model, when the covariance function is suitably specified. 

Suppose the function g{x, w) in ^ has the form 

g{xi,Wi) = h{xi) + awi (21) 



where Wi ~ A'^(0, 1). If we only observe Xi but not Wi, then (21) is a regression problem with 
unknown i.i.d. Gaussian residuals with mean and variance o"^, which is equivalent to the standard 
GP regression model ([T]), if we give a GP prior to h. If we specify a covariance function that produces 
such a g{x,w), then if we set = (or equivalently, the hyperparameter a'^ = Var(Ci) = 0), our 
GPLC model will be equivalent to the standard GP model. Below, we compute the covariance 
between training cases i and j (with latent values Wi and wj) to find out the form of the appropriate 
covariance function. 
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Let's put a GP prior with zero mean and covariance function Ki{xi, xj) on h{x). As usual, the Wi 
have independent A^(0, 1) priors. Since the values of g{x, w) are a linear combination of independent 
Gaussians, they will have a Gaussian process distribution, conditional on the hyperparameters. Now 
the covariance between cases i and j is 

Cov[g{xi,Wi),g{xj,Wj)] = E[{h{xi) + awi){h{xj) + awj)] 

= E[h{xi)h{xj)] + a'^WiWj 

= Ki{xi,Xj) + a'^WiWj (22) 

Therefore, if we put a GP prior on g(x,w) with zero mean and covariance function 

K[{xi,Wi),{xj,Wj)] = Ki{xi,Xj) + a'^WiWj (23) 

the results given by GPLC will be equivalent to standard GP regression with covariance function 
Ki. In practice, if we are willing to make the assumption that the residuals have equal variances (or 
know this as a fact), this modified GPLC model is not useful, since the complexity of handling latent 
variables computationally is unnecessary. However, consider a more general covariance function 

K[{xi,Wi), {xj,Wj)] = Ki[{xi,Wi), ixj,Wj)] + K2[{xi,Wi), (xj^Wj)] (24) 

where Ki[{xi,Wi),{xj,Wj)] = exp {-YX=ii^ik - ^jk)'^ / pI ~ i'^i ~ '^j)'^ / pI+i) is a squared expo- 
nential covariance function with ARD, and K2[{xi,Wi),{xj,Wj)] = YX=i'yk^ikXjk + "fp+iWiWj is 
a linear covariance function with ARD. Then the covariance function (|23l) can be obtained as a 



limiting case of (24), when pp+i goes to infinity in Ki and 71,..., 7p all go to zero. Therefore, 
we could use this more general model, and let the data choose whether to (nearly) fit the simpler 
standard GP model. 

Similarly, if we believe that the function g{x,w) is of the form 

g{x,w) = hi{x) + ■wh2{x) (25) 

with hi and /i2 independently having Gaussian Process priors with covariance functions Ki and 
K2, we can use a GPLC model with a covariance function of the form 

K[{xi,Wi),{xj,Wj)] = Ki{xi,Xj) + WiWjK2{xi,Xj) (26) 



Now consider the alternative GPLV model (20): if we put independent GP priors on f{xi) and 



r{xi), each with zero mean, and covariance functions Ki and K2, respectively, then model (20) is 



equivalent to the modified GPLC model above with covariance function (26). The hyperparameters 
of Ki of both models should have the same posterior distribution, as would the hyperparameter of 
i^2- Notice that the two models have different latent variables: the latent variable in GPLC, Wi, is 
the value of the ith (normalized) residual; the latent variable in GPLV is Zi = r{xi), which is plus 
or minus the standard deviation of the ith residual. 



3 Computation 

Bayesian inference for GP models is based on the posterior distribution of the hyperparameters and 
the latent variables. Unfortunately this distribution is seldom analytically tractable. We usually 
use Markov Chain Monte Carlo to sample the hyperparameters and the latent values from their 
posterior distribution. 
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3.1 Overview of methods 



Common choices of MCMC method include the classic Metropolis algorithm (Metropolis et. al 



1953) and slice sampling (Neal, 2003). The Metropolis sampler is easy to implement, but for 



high-dimensional distributions, it is generally hard to tune. We can update all the parameters at 
each iteration using a multivariate proposal distribution (e.g. N(0,D) where D is diagonal), or 
we can update one parameter at a time based on a univariate proposal distribution. Either way, 
to contract an efficient MCMC method we have to assign an appropriate value for the proposal 
standard deviation (a "tuning parameter") for each hyperparameter or latent variable so that the 
acceptance rate on each variable is neither too big nor too small (generally, between 20% and 80% 
is acceptable, though the optimal value is typically unknown). There is generally no good way to 
find out what tuning parameter value is the best for each variable other than trial and error. For 
high-dimensional problems, tuning the chain is very difficult. Using squared exponential covariance 
function, our model GPLC has D = n + p + 3 variables (including hyperparameters and latent 
variables), and the GPLV model has D = n + 2p + 2 variables. 

Slice sampling, on the other hand, although slightly more difficult to implement, is relatively 
easier to tune. It does also have tuning parameters (one can control the step-out size and the number 
of step-outs), but the performance of the chain is not very sensitive to the tuning parameters. 
Figure 2.7 of [Thompson (2011) demonstrates that step-out sizes from 1 to 1000 all lead to similar 
computation time, while a change in proposal standard deviation from 1 to 1000 for a Metropolis 
sampler can result in a MCMC which is 10 to 100 times slower. In this paper, we use univariate 
step-out slice sampling for regular CP regression models and GPLC models. For GPLV, since the 
latent values are highly correlated, regular Metropolis and slice samplers do not work well. We will 
give a modified Metropolis sampler than works better than both of these simpler samplers. 



3.2 Major computations for GP models 

Evaluating the log of the posterior probability density of a GP model is typically dominated by com- 
putating the covariance matrix, C, and finding the Cholesky decomposition of C, with complexities 
pn? and n^, respectively. 

For both standard GP models and GPLV models, updates of most of the hyperparameters require 
that the covariance matrix C be recomputed, and hence also the Cholesky decomposition (denoted 
as chol(C)). For GPLC, when the ith latent variable is updated, most of C is unchanged, except for 
the ith row and ith column. This change requires a rank-n update on the Cholesky decomposition, 
which is almost as costly as as finding the Cholesky decomposition for a new C. 

Things are slightly more complicated for GPLV, since the model consists of two CPs, with two 
covariance matrices. When one of the hyperparameters for the main GP (denoted as 9y) is changed, 
the covariance matrix for the main GP, Cy, is changed, and thus chol(Cy) has to be recomputed. 
However, Cz, the covariance matrix for the secondary GP, remain unchanged. When one of 6z, the 
hyperparameters of the secondary GP is changed, Cy (and chol(Cj;)) remain unchanged, but Cz and 
chol(C2) must be recomputed. When one of the latent values, say the ith, is changed, Cz remains 
unchanged as it only depends on x and 6z, but the ith entry on the diagonal of Cy is changed. This 



minor change to Cy requires only a rank-1 update (Sherman et. al, 1950), with complexity n 



We list the major operations for the GP models discussed in this paper in Table [T| 
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one hyperparameter 


one latent variable 


STD 


V-/ jJCl CL tiWli 


C choKC) 










-//- of mirli ODOi'^itinrm 


p-\-2 




GPLC 


Operation 


C cholfC) 


1/n of C, aU of chol(C) 


Complexity 


pn'', n'^ 


pn, n"^ 


# of such operations 


P + 3 


n 


GPLV 


Operation 


Cy, chol(Cy) or Cz, chol(C2) 


rank-1 update Cy 


Complexity 


pn'', n'^ 




^ of such operations 


2p + 2 


n 



Table 1: Major operations needed when hyperparameters and latent variables change in GP models. 



3.3 A modified Metropolis sampler for GPLV 



Neal ( 1998 ) describes a method for updating latent variables in a GP model that uses a proposal 
distribution that takes into account the correlation information. This method proposes to change 
the current latent values, z, to a z' obtained by 



z 



' = (\-c?fl'^z^aLu (27) 



where a is a small constant (a tuning parameter, typically slightly greater than zero), L is the 
lower triangular Cholesky decomposition for C2, the covariance matrix for the N(^,Cz{Q;^^ prior 
for z, and u is a random vector of i.i.d. standard normal values. The transition from z to z' is 
reversible, and leaves the prior for z invariant. Because of this, the Metropolis-Hastings acceptance 
probability for these proposals depends only on the ratio of likelihoods for z' and z. 

We will use this method to develop a sampling strategy for GPLV. Recall the unnormalized 
posterior distribution for the hyperparameters and latent values is given by 

viQy, e„ z\x, y) = M{y\0, CyiOy, z))M{z\0, C,{e,))p{9y, 9,) 

To obtain new values O'y, 9'^ and z' based on current values 9y, 9z and z, we can do the following: 

1. For each of the hyperparameters in 9y (i.e. those associated with the "main" GP), do an 
update of this hyperparameter (for instance a Metropolis or slice sampling update). Notice 
that for each of these updates we need to recompute chol(Cy), but not chol(Cz), since Cz 
does not depend on 9y. 

2. For each of the hyperparameters in 9^ (i.e. those for the "secondary" GP): 

(a) Do an update of this hyperparameter (e.g. with Metropolis or slice sampling). We need 
to recompute cho^C^) for this, but not chol(Cy), since Cy does not depend on 9z- 

(b) Update all of z with the proposal described in ([27]). We need to recompute chol(Cy) to 
do this, but not chol(C2), since Cz depends only on 9z but not z. We repeat this step 
for m times (a tuning parameter) before moving to the next hyperparameter in 9z- 

In this scheme, the hyperparameters 9y and 9z are not highly correlated and hence are relatively 
easy to sample using the Metropolis algorithm. The latent variables z are highly correlated. Because 
updating the z-values is hard, we try to update them as much as possible. Notice that Cz depends 
only on x and 9z, so a change of z will not result in a change of Cz- Hence once we update a 
component of 9z (and obtain a new Cz), it makes sense to do m > 1 updates on z before updating 
another z- hyperparameter. 
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4 Experiments 



We will compare our GPLC model with Goldberg et. al. s GPLV model, and with a standard GP 



regression model having Gaussian residuals of constant variance. 
4.1 Datasets 

We use four synthetic datasets, with one or three covariates, and Gaussian or non-Gaussian resid- 
uals, as summarized below: 



Dataset 


P 


True function 


Residual SD 


Residual distribution 


Ul 


1 


f{^) 


r{x) 


Gaussian 


U2 


1 




r{x) 


non-Gaussian 


Ml 


3 


9{x) 


s{x) 


Gaussian 


M2 


3 


9{x) 


s{x) 


non-Gaussian 



Datasets Ul and U2 both have one covariate, which is uniformly drawn from [0,1], and the true 
function is 

fix,) = [l + sin(4xi)]i-i 

For Ul, the response = f{xi) + €i is contaminated with a Gaussian noise, e^, with input- 
dependent standard deviation 

SD{e,) = r{xi) = 0.2 + 0.3 exp[-30(xi - 0.5)^] 



For U2, the resp onse has a non- Gaussian residual, u, with a location-scale extreme value distribu- 

201l[ ), with probability density (l/o-)e('^-^)/'^ exp (-e^'^"'')/'^) . 
0.5772 ... is the Euler's constant. The variance of u 



tion, EV{fi, a) (see 



Leadbetter et. al. 



The mean of oj is ^^(a;) = + o"7, where 7 
is Var(a;) = vr^u^/G. We translate and scale the EV residuals so that their mean is zero and their 
standard deviation is r{x) (same as those of e in Ul). The density curve of a EV residual with 
mean and variance 1 is shown in Figure [3| 



Extreme Value Distribution 




Figure 3: Density curve of extreme value residual with mean and variance 1. 
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Datasets Ml and M2 both have three independent, standard normal covariates, denoted as 
X = {xi,X2,x^). The true function is 



g{x) = [1 + sin(xi/1.5 + 2)f-^ - [1 + sin(2;2/2 + xg/S - 2)^-^ 

As we did for Ul and U2, we add Gaussian residuals to Ml and extreme value residuals to M2. 
For both Ml and M2, the standard deviations of these residuals depend on the input covariates as 
follows: 

s{x) = 0.1 + 0.4exp[-0.2(xi - 1)^ - 0.3(a;2 - 2)^] + 0.3exp[-0.3(2;3 + 2)^] 
4.2 Predictive performance of the methods 

For each dataset (Ul, U2, Ml, and M2), we randomly generated 10 different training sets (using 
the same program but different random seeds), each with n = 100 observations, and a test dataset 
with A'^ = 5000 observations. We obtained MCMC samples using the methods described in the 
previous section, dropping the initial 1/4 samples as burn-in, and used them to make predictions 
for the test cases. 

In order to evaluate how well each model does in terms of the mean of its predictive distribution, 
we computed the mean squared error (MSE) with respect to the true function values /(xj) as 
follows 

1 ^ 

MSE(y) = -^(y,-/(x.))^ (28) 

i=l 

where yi, ...,y]\f) are the predicted responses for test cases. We also computed the average negative 
log-probability density (NLPD) of the responses in the test cases, as follows 

N / M \ 

NLPD(y) = -^El°g M E^(2^^^^IA^^'4) (29) 
1=1 \ j=i J 

where ^p{■\|J,,a'^) denotes the probability density for N{fi,a'^), fuj^afj is the predictive mean and 
variance for test case y^*) using the hyperparameters and latent variables from the jth MCMC 
iteration, and M is the number of MCMC samples used for prediction. 

We give pairwise comparison of the MSE and the NLPD in Figures |4j [5| |6j and [7j These plots 
show that both GPLC and GPLV give smaller NLPD values than the standard CP model for all 
datasets. At least for the multivariate datasets, GPLC and GPLV also give smaller MSEs than the 
standard CP model. This shows that both GPLC and GPLV can be effective for heteroscedastic 
regression problems. 

Comparing GPLC and GPLV, we notice that for datasets with Gaussian residuals, GPLC is 
almost as good as GPLV (except for the NLPDs for Ul, where GPLV gives smaller values 8 out of 
10 times), while for non-Gaussian residuals, GPLC is the clear winner, giving MSEs and NLPDs 
that are smaller than for GPLV most of the time. The numerical MSE and NLPD values are listed 
in the Appendix. 
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Figure 4: Dataset Ul: Pairwise comparison of methods using NLPD(Left) and MSE(right) 
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Figure 5: Dataset U2: Pairwise comparison of methods using NLPD(Left) and MSE(right) 
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Figure 6: Dataset Ml: Pairwise comparison of methods using NLPD(Left) and MSE(right) 
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Figure 7: Dataset M2: Pairwise comparison of methods using NLPD(Left) and MSE(right) 
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4.3 Comparison of MCMC methods for GPLV models 



To test whether or not the modified Metropohs sampler described in Section 3.3 is effective, we 
compare it to two standard samplers, which update the latent values one by one using either the 
Metropolis algorithm or the univariate step-out slice sampler. The Metropolis algorithm is used to 
update the hyperparameters in all of the three samplers. The significant computations are listed 
in Table [2 

We adjust the tuning parameters so that the above samplers are reasonably efficient. For the slice 
sampler, we use a slice width of 1, and allow indefiite stepping-out. For the univariate Metropolis 
sampler, we adjust the standard deviations of the Gaussian proposals so that the acceptance rate 
of each parameter variable is around 50%. For the modified Metropolis sampler, we set a = 0.3 
and m = 40. The acceptance rate for Zi is around 1.4%, but since we sample Zj 40 times for each 
iteration, we have about 60% of chance to get a new value of Zi while all hyperparameters are 
updated once. 

The efficiency of an MCMC method is usually measured by the autocorrelation time, r, of values 



from the chain it produces (see Neal, 1993): 



1+2^: 



li (30) 
1=1 

where 7i is the lag-z autocorrelation. Roughly speaking, the autocorrelation time is the number of 
iterations a sampler needs to obtain another nearly uncorrelated sample. 

With an MCMC sample of size M, we can only estimate sample autocorrelations 7, up to 
i = M — 1, and since these estimates are all noisy, we usually estimate r from a more limited set 
of autocorrelations, as follows: 

k 

]ii (31) 

1=1 

where /c is a cut-off point where 7^ seems to be nearly for all i > k, 



I + 2E' 



In our experiment, we will consider the autocorrelation time of both the hyperparameters and 
the latent variables. The model has four hyperparameters {rjy, py and r]z, Pz), so it is not too difficult 
to look at all of them. But there are n = 100 latent variables, each with its own autocorrelation 
time. Instead of comparing all of them one by one, we will compare the autocorrelation time of the 
sum of the latent variables as well as the sum of the squares of the latent variables. 

Another measure of sampler efficiency is the time it takes for a sampler to reach equilibrium, 
i.e. the stationary distribution of the Markov chain. This is often referred to as the Markov chain 
"mixing time" (denoted as Tm)- Theoretical bound of mixing rate can be found for some classical 





Oy 


Oz 


z 


Modified Metropolis 


Operation 


Cy, Chol(Cy) 


C„ chol(C,) 


Cy, chol(Cy) 


# of such operations 


P+l 


P+l 


mip + 1) 


Metropolis / Slice 


Operation 


Cy, chol((7y) 


Cz, chol(C,) 


rank-1 up chol(Cj^) 


# of such operations 


P+l 


p + l 


n 



Table 2: Major operations of the MCMC methods for GPLV 
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f 






Py 




Pz 






Modified Metropolis 


3.06 


4.92 


3.09 


10.63 


0.32 


0.46 


Metropolis 


2.81 


11.21 


2.73 


27.26 


13.78 


13.92 


Slice 


13.37 


16.71 


11.85 


23.65 


37.81 


53.81 



Table 3: Autocorrelation times (adjusted for computation time) of MCMC methods for GPLV. 

algorithms, however, in practice, the mixing time of a sophisticated sampler is usually very difficult 
to determine. It is common practice to look at trace plots of log-probability density values to 
decide whether or not a chain has reached equilibrium (this is usually how the number of "burn-in" 
iterations is decided). 

The mixing time and the autocorrelation time usually agree in the sense that a sampler that 
takes less time to get a new uncorrelated sample can usually achieve equilibrium faster, and vice 
versa, though this is not always the case. 

Note both autocorrelation time r and mixing time Tjv/ take the number of iterations of the a 
Markov chain as their unit of measurement. However, the CPU time required for an iteration 
differs between samplers. For a fair comparison, we adjust r by multiplying it with the average 
CPU time per iteration. The result, which we denote as f, measures the CPU time a sampler needs 
to obtain an uncorrelated sample. To fairly compare mixing times using trace plots, we will adjust 
the number of iterations in the plots so that each entire trace takes the same amount of time. 

We run the three samplers five times, starting from the same point (the prior mean) but with 
different random seeds. The average adjusted autocorrelation times are listed in Table |3j The 
modified Metropolis sampler significantly outperforms the others at sampling the latent variables: 
it is about 50 to 100 times faster than the regular Metropolis sampler and slice sampler. For the 
hyperparameters, however, the modified Metropolis sampler gives roughly the same autocorrelation 
times as the regular Metropolis sampler does. Both of them seems to work better than slice sampler, 
but the difference is much smaller than the difference in sampling latent variables. Figure [8] shows 
selected autocorrelation plots from one of the five runs (adjusted for computation time). 

Figure [9] gives the trace plots of the three methods for one of the five runs (other runs are similar). 
The bottom plot is the trace of log-probability density values of the initial 400 iterations of the 
slice sampler. The middle plot shows the initial iterations of the regular Metropolis sampler, with 
the number adjusted to take the same time as the 400 slice sampler iterations. The top plot shows 
the initial iterations of the modified Metropolis sampler, again taking the same computation time. 

It is clear that the modified Metropolis takes the least time to mix. Starting from the prior 
mean (which seem to be a reasonable initial point), with log-probability density (LPD) value of 
approximately —13, the modified Metropolis method immediately pushes the LPD to 70 at the 
second step, and then soon declines slightly to what appears to be the equilibrium distribution. 
The other two methods move take much more time to reach this equilibrium distribution, with the 
simple Metropolis and slice sampling methods taking roughly the same amount of time. 

We conclude that the modified Metropolis is the best of these MCMC method — the fastest to 
reach equilibrium, the best at sampling latent values thereafter, and at least as good at sampling 
hyperparameters. 
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Figure 8: Selected autocorrelation plots for MCMC methods for GPLV (with horizontal scales that 
adjust for computation time). 
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Figure 9: Trace plots of log posterior density for MCMC methods for GPLV. 
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In a simpler context — financial time series where no 
response can be taken to be always zero 



■'main" GP is needed, since the mean 



Wilson and Ghahramani (2010) use the "elliptical slice 



sampling" method of Murray, et. al. (2010) to sample latent values that determine the variances 
of observations. Elliptical slice sampling is related to the modified Metropolis method above. It 
would be interesting to see how they compare in a general regression context. 
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Appendix: MSE and NLPD for each method 



and training set 



Dataset Ul 



Dataset U2 



Dataset Ml 



Dataset M2 



Training Set 


NLPD 


MSE 


REG 


GPLC 


GPLV 


REG 


GPLC 


GPLV 
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. 3364 


0.2459 


0.2480 
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0.0088 


0.0059 
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0.2374 
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0.0030 
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0.0055 
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0.4726 
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. 0186 


0.0120 


0.0201 
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. 3852 


. 3046 


0.2860 


.0121 


.0113 


0.0098 


3 


.4560 


0.3390 


0.3410 


0.0305 


0.0247 


0.0277 


4 


.4300 


. 3860 


0.3751 


0.0204 


0.0188 


. 0201 


5 


0.4817 


.4321 


0.3906 


0.0426 


0.0399 


0.0351 


6 


0.4668 


. 3603 


0.3024 


. 0364 


0.0247 


0.0163 


7 


. 4282 


0.3656 


. 3799 


. 0195 


.0172 


0.0171 


8 


0.4184 


0.3198 


0.4022 


. 0197 


0.0148 


. 0217 
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0.4161 


0.3281 


0.3817 


0.0202 
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0.0297 
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0.4253 
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0.3201 


0.0216 


0.0190 


0.0197 
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. 3736 


0.2413 


0.298 


. 0099 


0.0092 


. 0093 
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.3934 


0.2458 


0.2965 


0.0136 


0.0099 


0.0122 
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.4015 


0.2695 


0.2832 


0.0167 


0.0105 


0.0128 


4 


0.4819 


0.3636 


0.4202 


. 0391 


0.0174 


. 0489 
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0.4311 


0.3257 


0.3548 


0.0269 


0.0234 


0.0230 
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0.4348 


0.2678 


0.3140 


0.0216 


0.0165 


0.0176 


7 


0.3892 


0.2479 


.2770 


. 0102 


0.0065 


0.0083 
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. 3709 


0.3147 


0.2580 


. 0058 


0.0052 


. 0067 


9 


.4043 


0.2441 


0.2899 


. 0156 


0.0124 


0.0146 


10 


0.4718 


0.3355 


0.3923 


0.0403 


0.0223 


0.0281 
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